Multi-label image classification is a fundamental but challenging task incomputer vision. Great progress has been achieved by exploiting semanticrelations between labels in recent years. However, conventional approaches areunable to model the underlying spatial relations between labels in multi-labelimages, because spatial annotations of the labels are generally not provided.In this paper, we propose a unified deep neural network that exploits bothsemantic and spatial relations between labels with only image-levelsupervisions. Given a multi-label image, our proposed Spatial RegularizationNetwork (SRN) generates attention maps for all labels and captures theunderlying relations between them via learnable convolutions. By aggregatingthe regularized classification results with original results by a ResNet-101network, the classification performance can be consistently improved. The wholedeep neural network is trained end-to-end with only image-level annotations,thus requires no additional efforts on image annotations. Extensive evaluationson 3 public datasets with different types of labels show that our approachsignificantly outperforms state-of-the-arts and has strong generalizationcapability. Analysis of the learned SRN model demonstrates that it caneffectively capture both semantic and spatial relations of labels for improvingclassification performance.
展开▼